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Abstract. I describe here the performances of a parallel treecode with individ- 
ual particle timesteps. The code is based on the Barnes-Hut algorithm and runs 
cosmological N-body simulations on parallel machines with a distributed memory 
architecture using the MPI message passing library. For a configuration with a 
constant number of particles per processor the scalability of the code has been 
tested up to P — 32 processors. The average CPU time per processor necessary 
for solving the gravitational interactions is within ~ 10% of that expected from 
the ideal scaling relation. The load balancing efficiency is high ( Si 90%) if the 
processor domains are determined every large timestep according to a weighting 
scheme which takes into account the total particle computational load within the 
timestep. 
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1. Introduction 

Numerical simulations play a fundamental 
role for improving the theoretical under- 
standing of structure formation. This ap- 
proach has received a large impulse from 
the huge growth of computer technology 
in the last two decades. Cosmological N- 
body simulations are now widely used as a 
fundamental tool in modern cosmology for 
testing viable theories of structure forma- 
tion. A popular approach for solving the 
gravitational forces of the system is the 



point in favor of tree codes is that individ- 
ual timesteps for all of the particles can be 
implemented easily, this allows a substan- 
tial speed-up of the force evaluation for a 
clustered distribution. 

An important task is the improvement 
of the dynamic range of the simulations. 
Large simulation volumes are required for 
statistical purposes, but at the same time 
modelling the formation and evolution of 
each individual galaxy in the simulated 
volume requires that a realistic simulation 
tree algorithm flAppel | |1985| ; |Hernquist | should be implemented with 10 s - 10 9 par- 
ticles. This computational task can be effi- 
ciently solved if the code is adapted to work 
on a parallel machine where many proces- 
sors are linked together with a communi- 
cation network. This has led a number of 



1987 ). The particle distribution of the sys- 
tem is arranged into a hierarchy of cubes 
and the force on an individual particle is 
computed by a summation over the multi- 
pole expansion of the cubes. An important 



authors to parallelize treecodes (Salmon 



Send offprint requests to: R. Valdarnini 



1991; Warren 1994: Dubinski 1996; Dave 



2 



R. Valdarnini: Parallel treecode 



Dubinski & Hernquist 


1997; 


Lia & Carraro 




2000 


; 3pringel, Yoshida & White 


2001; 




Miocchi & Capuzzo-Dolcetta 


E00^ 


). In 



this paper I present a parallel implemen- 
tation of a multistep treecode based on 
the Barnes-Hut (1986, BH) algorithm. The 
code is cosmological and uses the MPI mes- 
sage library. 



2.1. Domain decomposition 

The spatial domains of the processors are 
determined according to the orthogonal re- 
cursion bisection (ORB, Salmon 1991). The 
computational volume is first cut along the 
x-axis at a position x c such that 



Wj 



(2) 



2. Parallelization of a treecode 

The BH algorithm works by subdividing a 
root box of size L, which contains all of the 
simulation particles, into 8 subvolumes of 
size L/2. This procedure is then repeated 
for each of the subcubes and continues un- 
til the remaining cells or nodes are empty 
or have one particle. After the k — th itera- 
tion the size of the subcubes is Ik = L/2 k . 
After the tree construction is complete the 
multipole moments of the mass distribution 
inside the cells are computed starting from 
the smallest cells and proceeding up to the 
root cell. The moments of the cells are typi- 
cally approximated up to quadrupole order. 
For each particle the acceleration is evalu- 
ated by summing the contribution of all of 
the cells and particles which are in an in- 
teraction list. The list is constructed start- 
ing from the root cell and descending the 
tree down to a required level of accuracy. 
At each level a cell of the tree is accepted 
if it satisfies an accuracy criterion. If the 
cell fails this criterion then it is opened, 
the particles contained are added to the 
interaction list and the accuracy criterion 
is applied again for the remaining subcells. 
The following acceptance criterion has been 
used ([Barnes ||l99^; [Dubinski 1 11996[) 



d > lk/0 + S, 



(1) 



where d is the distance between the center 
of mass (c.o.m.) of the cell and the particle 
position, 6 is an input parameter that con- 
trols the accuracy of the force evaluation, 
and 6 is the distance between the cell c.o.m. 
and its geometrical center. 



where the summations are over all of the 
particles with Xi < x c or Xi > x c and Wi oc 
Nop{i) is a weight assigned to each par- 
ticle proportional to the number of float- 
ing point operations (i.e. the computational 
work) which are necessary to compute the 
particle force. 

When the root x c has been determined 
the particles are then exchanged between 
the processors, until all of the particles with 
Xi < x c belong to the first P/2 processors 
and those with Xi > x c are in the sec- 
ond P/2 processors. The whole procedure 
is repeated recursively, cycling through the 
cartesian dimensions, until the total num- 
ber of subdivisions of the computational 
volume is log2P ( with this algorithm P 
is constrained to be a power of two). At 
the end of the domain decomposition the 
subvolumes will enclose a subset of parti- 
cles with approximately an equal amount of 
computational work. The calculation of the 
forces is then approximately load-balanced 
among all of the processors. 

2.2. Construction of the local essential 
tree 

A BH tree is constructed by each processor 
using the particles located in the proces- 
sor subvolume. However, the local tree does 
not contain all of the information needed to 
perform the force calculation for the pro- 
cessor particles. For these particles a subset 
of cells must be imported from the trees of 
the other processors according to the open- 
ing angle criterion applied to the remote 
cells. Each processor then receives a set of 
partial trees which are merged with the lo- 
cal tree to construct a local essential tree 
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(Dubinski 1996). The new local tree con- 
tains all of the information with which the 
forces of the local particles can be consis- 
tently calculated. 

The communications between proces- 
sors of nodes from different trees implies 
that in order to graft the imported cells 
onto the processor local trees it is necessary 
to adopt an efficient addressing scheme for 
the memory location of the nodes. This is 
easily obtained if the construction of the lo- 
cal trees starts from a root box of size L, 
common to all of the processors. The main 
advantage is that now the non-empty cells 
of the local trees have the same position 
and size in all of the processors. Each cell is 
then uniquely identified by a set of integers 
{ji, j'2, ...}, with each integer ranging from 
to 7 which identifies one of the 8 subcells 
of the parent cell. These integers can be 
conveniently mapped onto a single integer 
word of maximum bit length 3k rnax , where 
kmax is the maximum subdivision level of 
the tree. For a 64 bit key k max < 21. This 
integer word represents the binary key of 
the cell. When a cell of the tree is requested 
from a remote processor to construct its lo- 
cal tree, the associated key is sent together 
with the mass, c.o.m. and multipole mo- 
ments of the cell. The receiving processor 
then uses this key to quickly identify the 
cell location in the local tree and to add 
the new cell to the local tree. A similar ad- 
dressing scheme has been implemented, in 
their version of a parallel treecode, also by 



Miocchi fc Capuzzo-Dolcctta | ( |2002j ). An 
efficient construction of the local essential 
tree is thus obtained as follows. 

i) Once the ORB has been completed 
and each processor has received the parti- 
cle subset with spatial coordinates within 
its spatial domains, the local trees are con- 
structed according to BH in each of the 
processors Pk , where k is a processor index 
ranging from to P — 1. 

ii ) The communications between pro- 
cessors can be significantly reduced if one 
adopts the following criterion to construct 
the partial trees that will be exchanged 
between processors. After the local trees 



have been constructed, each processor ap- 
plies the opening angle criterion between 
the nodes of its local tree and the closest 
point of the volume of another processor 
Pk- The partial tree obtained contains by 
definition all the nodes of the local proces- 
sor necessary to evaluate the forces of the 
particles located in the processor Pk- This 
procedure is performed at the same time 
by each processor for all of the remaining 
P — 1 processors. At the end, each proces- 
sor has P — 1 lists of nodes which are nec- 
essary for the construction of the local es- 
sential trees in the other processors. The 
processor boundaries are determined dur- 
ing the ORB and are communicated be- 
tween all of the processors after its comple- 
tion. Therefore the main advantage of this 
procedure is that all of the communications 
between processors necessary for the con- 
struction of the local essential trees are per- 
formed in a single all-to-all message pass- 
ing routine. The drawback of this scheme is 
the memory overhead, because each proces- 
sor imports from another processor a list of 
nodes in excess of those effectively needed 
to perform the force calculation. As a rule 
of thumb it has been found that for 9 = 0.4 
a processor with N p particles and N c cells 
imports ~ N p /8 — N p /A particles and ~ N c 
cells. The number of imported nodes is in- 
dependent of the processor number. The 
value = 0.4 is a lower limit that guaran- 
tees reasonable accuracy in the force evalu- 
ation in many simulations. In the communi- 
cation phase between processors mass and 
position are imported for each particle, and 
the mass, c.o.m., quadrupole moment and 
the binary key are imported for each cell. 

The memory required by a single pro- 
cessor to construct the local essential tree 
is then approximately a factor ~ 2 larger 
than that used in the implementation of 
the local tree. This memory requirement 
can be efficiently managed with dynamic 
allocation, and is not significantly larger 
than that required with other schemes used 
to construct the local essential tree (e.g., 
Dubinski 1996). 
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2.3. Force calculation 

After the construction of the local essen- 
tial trees has been completed, each pro- 
cessor proceeds asynchronously to calcu- 
late up to the quadrupole order the forces 
of the active particles in its computational 
volume. The code has incorporated peri- 
odic boundary conditions and comoving 
coordinates. Therefore the forces obtained 
from the interaction lists of the local es- 
sential trees must be corrected to take 
into account the contribution of the im- 
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)■ 



correction terms are calculated before the 
simulation using the Ewald method. The 
corrections are computed on a cubic mesh 
of size L with 50 3 grid points and stored in 
a file. During the force computation a lin- 
ear interpolation is used to calculate, from 
the stored values, the correction terms cor- 
responding to the particle positions. 

In a cosmological simulation the eval- 
uation of the peculiar forces in the lin- 
ear regime is subject to large relative er- 
rors. This is because for a nearly homoge- 
neous distribution, the net force acting on 
a particle is the result of the cancellation 
of the large partial forces determined from 
the whole mass distribution. From a set of 
test simulations Dave et al. (1997) found 
that in the linear regime, when 9 = 0.4 
and the cell moments are evaluated up to 
the quadrupole order, the relative errors in 
the forces are ,$ 7%. This problem is not 
present at late epochs, when the clustering 
of the simulation particles is highly evolved 
and even for 9 ~ 1 the relative errors in 
the forces are small ( 1%). This imposes 
in the simulation the necessity of varying 9 
according to the clustering evolution, since 
the computational cost of evaluating the 
forces with a small value of 8 is wasted in 
the non-linear regime. In this regime the 
forces can be evaluated with an accuracy as 
good as that obtained in the linear regime, 
though using an higher value of 9. 

After several tests it has been found 
that a good criterion to control the value 



of 9{t) is that at any given simulation time 
t the energy conservation must be satis- 
fied with a specified level of accuracy. The 
Lyzer-Irvine equation is 

a 4 T + aU - J Uda = C, (3) 

where a = a(t) is the expansion fac- 
tor, T is the kinetic energy of the sys- 
tem, U is the potential energy and C 
is a constant. The accuracy of the inte- 
gration can be measured by the quantity 
err(t) = |A(C)/A(aJ7)|, where Af de- 
notes the change of / with respect its ini- 
tial value. The time evolution of err(t) has 
been analyzed for different test simulations. 
The cosmological model considered is a fiat 
CDM model, with a vacuum energy den- 
sity Q\ = 0.7, matter density parameter 
Q m = 0.3 and Hubble constant h — 0.7 
in units of 100K msec~ 1 Mpc~ 1 . The power 
spectrum of the density fluctuations has 
been normalized so that the r.m.s. mass 
fluctuation in a sphere of radius %h~ 1 Mpc 
takes the value <j% = 1 &t the present epoch, 
a{t) = a fi n = 11. The simulations are run 
in a L = 200h~ 1 Mpc comoving box with 
N p = 84 3 particles. For a simulation with 
9 = const = 0.4 one has err{t) £ 10~ 3 
even in non-linear regimes, when a(t) ap- 
proaches its final value. The distribution of 
the relative root mean square errors in the 
force components of the particles can be re- 
produced if during the simulation 9 = 9{t) 
increases with time, provided that its value 
never exceeds an upper limit implicitly de- 
fined for erg > 0.2 by the constraint 

AC/A(aU) < 0.025/[l+ (0.4/crg) 3 ] 1 - 7 . (4) 

An additional constraint sets an upper 
limit 9 < 0.9. This criterion can therefore 
be profitably used to constrain the value 
of 9(t) according to the clustering evolu- 
tion, and at the same time to maintain the 
relative errors in the forces below a fixed 
threshold ( ;$ 3%). This allows a substan- 
tial increase in the code performances. The 
computational cost of evaluating the forces 
depends on 9 and for the considered runs 
at a(t) = 11 it is significantly reduced by 
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a factor 10 to 20 when 6 is increased from 
0.4 to - 0.9. 

2.4. Multiple timesteps and particle 
update 

After the force calculation is complete, par- 
ticle velocities and positions are updated in 
each processor. In the individual timestep 
scheme (Hernquist & Katz 1989] ) the par- 
ticle timestep of particle i is defined as 
Ati — At /2 rii , where rij > is an inte- 
ger. The particle timesteps are determined 
according to several criteria. The first is im- 
portant at early epochs and requires that 



At; < At exp = 0.03 2/3 H(t), 



(5) 



where H(t) is the Hubble parameter at the 
simulation time t. The other two criteria 
are 



Ati < 0.3(6, a 3 (t)/ 5l ) 1/2 
AU < 0.3(si/vi) , 



(G) 
(7) 



where £; is the comoving gravitational soft- 
ening parameter of the particle i, gi is the 
peculiar acceleration and Vi its peculiar ve- 
locity. These criteria are similar to those 
adopted by Dave et al. (1997). At the be- 
ginning of the integration t — tj„ , the forces 
are evaluated for all of the particles and 
their positions are advanced by half of the 
initial timestep, which is common to all 
the particles. In this integration scheme the 
forces are evaluated at later times t > U n 
only for those particles for which is nec- 
essary to maintain the second-order accu- 
racy of the leapfrog integrator. The parti- 
cle positions are advanced using the small- 
est timestep At m j„, as determined by the 
above constraints. In the parallel imple- 
mentation, each processor determines the 
individual particle timesteps and the small- 
est timestep At^ n of its particle subset, 
At m i n is then the smallest of these At^^ 
and is used by all the processors. 

After that particle positions have been 
updated, it may happen that a fraction 



of the particles assigned to a given pro- 
cessor have escaped the processor subvol- 
ume. At each timestep the particles that 
are not located within the original proces- 
sor boundaries are migrated between the 
processors. In principle the computational 
cost of locating the processor to which the 
escaped particle belongs scales as the pro- 
cessor number P ( Dave et al. 1997, sect. 
5.1). However, if during the ORB the pro- 
cessors are partitioned according to the 
procedure described in sect. 2.1, the final 
processor ordering makes it possible to re- 
duce to ~ log 2 P the number of positional 
tests of the particle. This is not a significant 
improvement in pure gravity simulations, 
where the fraction of particles that leave a 
processor at each step is small (~ 5%), but 
is important in a future implementation of 
the code that will incorporate smoothed 
particle hydrodynamic (Dave et al. 1997, 
Springel et al. 2000). In such a scheme gas 
properties of a particle are estimated by av- 
eraging over a number of neighbors of the 
particle; in a parallel implementation an ef- 
ficient location of the particle neighbors lo- 
cated in the other processors is important 
in order to improve the code performances. 

3. Performances 

The treecode described here uses the BH 
tree algorithm to compute the gravitational 
forces. The implementation of this algo- 
rithm is identical to that of Dubinski (1996) 
and Dave et al. (1997). The dependence 
of the errors in the force evaluations on a 
number of input parameters has been dis- 
cussed previously by these authors, there- 
fore an error analysis of the forces will not 
be presented here. 

3.1. Scalability 

The computational speed of the code is de- 
fined as the particle number divided by the 
elapsed CPU wall-clock time t so i ve spent 
in the force computation of the particles. 
For a specified accuracy and particle dis- 
tribution, the CPU time t so i ve of a paral- 
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Fig. 1. The averaged elapsed CPU time t so i ve spent in the force evaluation as a function 
of the number of processors P. The value of the opening angle parameter is 9 = 0.4, 
quadrupole moments are taken into account. The tests have been performed on an IBM 
SP3 machine, the error bars are the dispersions over the P processors. The total particle 
number of a test is N p = 32 3 P. The dashed line is the CPU time for constructing the tree 
scaled up by a factor 10. The initial particle configuration is determined from a uniform 
distribution perturbed according to a CDM power spectrum (see text). The dashed line 
is the expected t so i ve from the ideal scaling relation cx log(N p ). 



lei treecode with maximum theoretical effi- 
ciency is a fraction 1 jP of that of the serial 
code. 

The scalability of the parallel treecode 
has been tested by measuring t so i ve us- 
ing a different number of processors P. 
The initial positions of 32 3 P particles in 
a L = \\.\\h~ l Mpc comoving box have 
been perturbed according to a CDM model 
with Q m = 1, h = 0.5 and power spec- 
trum normalization as = 0.7 at the present 
epoch. The value of the opening angle is 
9 = 0.4 and forces have been computed at 
rcdshift z = 39. The number of particles 
N p = 32 3 P scales linearly with P. This 
dependence of N p on P has been chosen 



in order to consistently compare the force 
solving CPU time t so i ve with the one nec- 
essary for the construction of the local es- 
sential tree. With the choice N p /P — const 
the CPU time t so i ve scales ideally as oc 
log-ZVg oc log P. The results are shown in 
Fig. [I], where t so i ve is plotted (continuous 
line) up to P = 32. For a configuration of 
P processors t so i ve is defined as the average 
of the values of the individual processors. 
The tests have been performed on an IBM 
SP3 machine. The dotted line shows the 
ideal scaling relation. In the large P limit 
tsoive is approximately 10% higher than the 
ideal scaling relation. This is probably due 
to cache effects of the machine which arise 
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when N p ^ 10 5 during the tree descent nec- 
essary to calculate the forces. An impor- 
tant result is the time t tree required to con- 
struct the local essential trees. The dashed 
line of Fig. |l| shows that this time is al- 
ways a small fraction ( ^ 5%) of the time 
required to compute the gravity. It is worth 
stressing that the communication part is 
efficiently handled by the all-to-all routine 
and the corresponding time is a negligible 
fraction of t tre e- The computational speed 
is ~ 32 3 /t so i ve ~ 500P part/ sec. This is 
valid for 9 = 0.4. If 9 is increased the parti- 
cle interaction list will have a smaller num- 
ber of terms and t so i ve will be smaller. It 
has been found that t so i ve (9) is well ap- 
proximated by t solve (6 < 1) oc l(T 5e / 3 . If 
9 = 1 the CPU times of Fig. g will then 
be reduced by a factor ~ 10. A compari- 
son of the code performances with those of 
other authors is difficult because of differ- 
ent algorithms, particle distributions and 
machines. An educated guess that the code 
has performances fairly comparable with 
those of the Springel e al. code (2000) is 
given by Fig. 12 of their paper. In this fig- 
ure the gravity speed as a function of the 
processor number is shown for a cosmo- 
logical hydrodynamic SPH simulation in a 
comoving box size of 50/i _1 Mpc with 32 3 
dark matter particles and 32 3 gas particles. 
The cosmology is given by a ACDM model 
with Q m — 0.3 and h — 0.7. The simu- 
lations are evolved from an initial redshift 
Zi = 10. The plotted speeds have been mea- 
sured on a CRAY T3E {300MHz clock). 
From Fig. 12 the computational speed of 
gravity for P = 32 is ~ 7500part/sec. The 
cosmological model is not that adopted in 
the tests of Fig. g, but at early redshifts the 
computational cost of the gravity force cal- 
culation is not strongly dependent on the 
assumed model. For P = 32 and 9 = 0.4 
the results of Fig. g give a gravity speed of 
~ l2-10 3 part/sec for a CDM model at m = 
39. The measured speed must be reduced 
by ~ 20% to take into account the higher 
clock rate of the IBM SP3 (375MHz). The 
final value (~ 9500part / sec) is similar to 
the one obtained by Springel et al. (2000). 



3.2. Load balancing 

An important characteristic of a treecode is 
load balancing. An ideal code should have 
the computational work divided evenly be- 
tween the processors. This is not always 
possible and code performances will be de- 
graded when the load unbalancing between 
the processors becomes large. At any point 
of the code where synchronous communica- 
tions are necessary there will be P — 1 pro- 
cessors waiting for the most heavily loaded 
one to complete its computational work. 
Load balancing can then be measured as 



t P )/t 



p ) I L max t 



(8) 



where t p is the CPU time spent by the 
processor p to complete a procedure and 
tmax is the maximum of the times t p . A 
treecode spends most of the CPU time in 
computing gravitational forces, and so it 
is essential to have good load balancing 
( ^, 90%) with the gravity routine. This 
task is not obviously achieved with a mul- 
tistep treecode. The number of active par- 
ticles for which it is necessary to com- 
pute the gravitational forces at t^ varies 
wildly with the timesteps. The current sim- 
ulation time tn is defined k steps after 
t n = nAt as t$ = t n + YjjZi At,-, the 
summation is at t > t n over the past 
timesteps Atj. At a certain step the ORB 
procedure described in sect. 2.1 can be 
used to obtain load balancing, but at later 
steps the unbalancing can be substantial. 
This problem has prompted some authors 
to consider more complicated approaches 



( Bpringcl, Yoshida fc White 2001 



fe Capuzzo-Dolcetta 2002). Here a sim 



Miocchi 



pier route is followed which starts from 
the observation that in a multistep inte- 
gration scheme, a better measure of the 
computational work done by each parti- 
cle i is given by Wi = J2k > w here 
oc Nop{i) is the number of float- 
ing point operations of particle i necessary 
to calculate the gravitational forces of the 
particle at the simulation time tn . If the 



R. Valdarnini: Parallel treecode 




0123456789 10 
j— th timestep 

I I I I I I I 1 I I 




0123456789 
j— th timestep 



Fig. 2. (a): Number of particles with timesteps Atj = Ato/2 3 at the end of a macrostep 
Ato- The simulation is that of sect. 3 with N p = 2 ■ 32 3 particles, Ato = i/m/424, 
At m i n — Aio/32, and simulation time t n = nAto = 225 Ato. (b): For the same particle 
distribution the corresponding computational loads — Y^ieAt- W» are snown i the 

summation is over all the particles i with timesteps Atj. The individual particle weight 

used in the ORB is Wi = J2k w i > nere w i ^ s t ne single-step particle work at the k — th 
step after t n and the summation is over the steps between t n and f n +i. 



particle i is not active at t„ , iu* = 0. 
The summation is over the steps between 
t n and t n + Ato, the weights Wi are now 
used at each large step Ato to subdivide 
the computational volume according to the 
ORB procedure. Theoretically this weight- 
ing scheme does not guarantee a perfect 
load balance, nonetheless it has been found 
to yield satisfactory results (L £ 90%) in 
many typical applications. The reason lies 
in the shape of the distribution function 
F(Ati) of the particle timesteps At;, for a 
simulation with an evolved clustering state. 
The number of particles with timesteps 
in the interval Atj , Atj + At is given by 
rij = FAt. The particle timesteps are de- 
termined according to the criteria defined 



in sect. 2.4; another parameter which deter- 
mines the shape of the distribution function 
is the minimum timestep Ai m j n . 

The optimal choice for At m i n requires 
that the number of particles of the binned 
distribution rij in the last time bin should 
be a small fraction of the total particle 
number (N opt ~ 10%jV p ). The distribu- 
tion rij of a test simulation is shown as 
a function of the particle timesteps Atj in 
Fig. ||a at the end of a large timestep Ato, 
when the particle positions are synchro- 
nized. The simulation is that of sect. 3.1 
with N p = 2-32 3 particles, At a = i/ in /424, 
Atmin = Ato/32 and simulation time t n = 
n At = 225 Ai . 
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Fig. 3. The load balancing scheme is tested for a parallel run with P = 4 processors. 
The simulation is that of Fig. @. (a): The top panel shows, between t n and t n +i, the 

(k) 

maximum of the CPU times of the P processors at the simulation time i„ , n = 225. 
The corresponding load balancing Lk is plotted in the mid panel (b) . The bottom panel 
shows the load balancing at the end of each macrostep Aio- The goodness of the weighting 
scheme is shown by the first point, where the ORB procedure has been performed setting 
Wi = const. 



The corresponding distribution of par- 
ticle computational loads Wi is shown 
in panel (b). The plotted distribution is 
— ^\ gAt Wi, the summation being 
over all of the particles i with timesteps 
Atj. About ~ 90% of the particles are in 
the first three time bins, it can be seen that 
for these bins the variations in the load dis- 
tribution are within ~ 20%. For example 
the number of particles rij with timcstcp 
At 3 = At a /8 is ~ no/10. The choice of 
a simple weighting scheme Wi cx Nop{i) 
would have given a shape of the load dis- 
tribution similar to that of nj. The rea- 
son for the shape of the load distribution 
of Fig. |^b is that in a multistep integra- 
tion scheme the particle forces are calcu- 



lated within a large timestep Ato when 
their positions must be synchronized. An 
optimal choice of the constraints on the 
particle timesteps yields a binned distri- 
bution rij with a hierarchy n^+x ~ %'/2. 
A weighing scheme that sums the number 
of floating point operations over a large 
timcstcp Aio takes into account the fact 
there are few particles with Atj <C Ai 
but that these particles have forces cal- 
culated a number of times cx Atj 1 . This 
weighting scheme leads, at the end of a 
large timestep Ato, to particle loads with 
a distribution which can be considered for 
practical purposes roughly constant for a 
large fraction of the simulation particles 
(~ 90%). An ORB domain decomposition 
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is then applied every large timestep Ato 
according to the calculated weights of the 
particles. The subdivision of the compu- 
tational load that follows from this ORB 
among the processors it is still unbalanced, 
but within a large timestep Aio the unbal- 
ancing is higher when the computational 
work is minimal. This is clearly illustrated 
in the example of Fig. ||. The simulation 
of Fig. H has been used at the same simu- 
lation time, with N p = 2 ■ 32 3 particles di- 
vided among P — 4 processors. The proces- 
sor domains have been found as previously 
discussed. Panel (a) shows the elapsed CPU 
wall-clock time spent by the parallel code 
to compute the gravitational forces. The 
CPU time is plotted between t n and t n+ i 
versus the simulation time t„ and is the 
maximum of the single processor values. 
There is a large burst of CPU work when 
the particles synchronized at t n k ^ are those 
with timesteps Ato, Ato/2 and Ato/ 4. The 
instantaneous load balancing Lr^) is calcu- 
lated using Eq. || between tn* 1 and t^ +1 \ It 
can be seen that drops to very ineffi- 
cient values ( ^ 0.3) when tn corresponds 
to a small number of active particles and it 
reaches a high efficiency ( ^ 0.9) with the 
highest CPU times. The overall load bal- 
ancing is measured by applying Eq. || be- 
tween every ORB domain decomposition: 
panel (c) shows < L > versus the simula- 
tion time t n for ten large timesteps Ato- To 
show how well the method is working the 
ORB procedure has been performed set- 
ting for the first step Wi = const, yielding 
< L >~ 0.5 . This proves that the load 
balancing performances are sensitive to the 
chosen weighting scheme and that the pro- 
cedure previously described is optimal to 
achieve a good load balance for the parallel 
tree code described here. 

Preliminary results show that this 
weighting scheme for the ORB domain de- 
composition can still be successfully used to 
obtain an efficient load balancing ( £ 90%) 
when the number of processors is higher 
than that of the tests performed here (P = 
32). This confirms that the load balancing 



efficiency of the adopted weighting scheme 
is robust for a variety of clustering states 
in cosmological simulations. 
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